arXiv: 1501.04206v2 [math.ST] 4 Feb 2015 


A note on boundary kernels for distribution function 

estimation 


Carlos Tenreiro* 
January 17, 2015 


Abstract 

The use of second order boundary kernels for distribution function estimation was 
recently addressed in the literature (C. Tenreiro, 2013, Boundary kernels for dis¬ 
tribution function estimation, RE VS TAT-Statistical Journal , 11, 169-190). In this 
note we return to the subject by considering an enlarged class of boundary kernels 
that shows it self to be especially performing when the classical kernel distribution 
function estimator suffers from severe boundary problems. 
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1 Introduction 


Given , X n independent copies of an absolutely continuous real random variable 

with unknown density and distribution functio ns f and F, respec t ively, t he classical kerne l 
es timator of F introduced by au thors such as Tia&o de Oliveira ( 1963 1. Nadaraval ( 1964 1 
or 


Watson and Leadbetterl (1196411 . is dehned, for x G K, by 

1 ( x — X, 


F nh {x) = -Y J R 
n 

1 = 1 


( 1 ) 


where, for u G 


K{u) = / K{v)dv , 


with K a kernel on M, that is, a bounded and symmetric probability density function 
with support [—1,1] and h — h n a sequence of strictly positive real numbers converging 
to zero when n goes t o infinity. For some recent references on this classical estima t or see 
Gine and Nickll (1200911 . IChacon and Rodriguez-Casall (1201(111 . iMason and Swanepoell (1201111 


and 


Chacon . Mo nfort and Tcnrciro (2014j). 


If the support of / is known to be the finite interval [a, b ], the previous kernel estimator 
Doundary problems if F' + (a) ^ 0 or F'_(b) ^ 0. This question is addressed 


suffers from 


m 


Tenreirol (2013|) by extending to the distribution function estimation framework the 
approac h followed in nonpa r ametr ic regression a n d density fun c tion e stima tion by a uthor s 


such as iGasser and Muller! (jl9T9h . iRicel (jl984ll . 


Gasser et al. 


diosd i 


and 


Muller ( 19911 1. 


Specially, the author considers the boundary modified kernel distribution function estimator 
given by 


0, 


F nh (x ) = < 


- V 

n 


i=i 


X - X; 


h 


1, 


where 0 < h < (b — a)/2 and 


F x ,h{F) — 


K L (u ; (x — a)/h), 
K(u), 

K R {u ; (b-x)/h), 


x < a 
a < x < b 
x > 6, 


a < x < a + h 
a + h<x<b — h 
b — h < x < 6, 


( 2 ) 


with 


K L (u;a) = / K L (v;a)dv and K R (u; a) = 1 


-+oo 


K r (v] a)dv, 


where X L (-; a) and K R (-; a) are, respectively, left and right boundary kernels for a G ]0,1[, 
that is, their supports are contained in the intervals [—l,a] and [—cv, 1], respectively, and 
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\n 0 /\(a) = f \K e (u]a)\ du < oo for all a e]0,1[ and £ = L,R (here and bellow integrals 
without integrations limits are meant over the whole real line). 

For ease of presentation, from now on we assume that the right boundary kernel K R is 
given by K R (u\ a) = K L (—U] a), the reason why only the left boundary kernel is mentioned 
in the following discussion. By assuming that K L (-',a ) is a second order kernel, that is, 

= 1, hi,L( ck) = 0 and /x 2 , l(ch) 7 ^ 0, for all a e]0,1[, (3) 


where we denote 


l^k,L(c() = / u k K L {u] a) du, for fceN, 


Tenreiro (2013) shows that the previous estimator is free of boundary problems and that 
the theoretical advantage of using boundary kernels is compatible with the natural property 
of getting a proper distribution function estimate. In fact, it is easy to see that the kernel 
distribution function estimator based on each one of the second order left boundary kernels 


K[{u\ a) = (2 K(a) - <u<a) : 

where we assume that K is such that /“ K[u)du > 0 for all a > 0, and 

K^iu^a) = K{u/a) / a. 


(4) 


(5) 


is, with probability one, a continuous probability distribution function (see Tenreirol, 


2013 


Examples 2.2 and 2.3). Additionally, the author shows that the Chung-Smirnov law of 
iterated logarithm is valid for the new estimator and has presented an asymptotic expans ion 
for its mean integrated squared error, from which the choice of h is discussed (see Tenr eirol. 


2013 


. Theorems 3.2, 4.1 and 4.2). 


A careful analysis of the asymptotic expansions presented in Tenreiro (120131 p. 171, 
178) for the local bias and the integrated squared bias of estimator (JT|) , suggests that 
the previous properties may still be valid for all the boundary kernels satisfying the less 
restricted condition 


a (1 — /i 0 ,L(a)) + = 0 , for all a E ] 0 , 1 [, ( 6 ) 

which is in particular fulhllcd by the left boundary kernel 

K R (u]a) = aK(u)I(- 1 <u< a)/(a/i 0 ,«(A") - /i 1>a (K)), (7) 

where we denote Hk,a(K) = u k K{u) du , for k G N (see Figured]). If K is a continuous 
density function, it is not hard to prove that the kernel distribution function estimator based 
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K((u\ a) 




I<2 0 ;«) 




k 3 («;«) 




Figure 1: Le/f boundary kernels K^(u;a) (left column) and Kjf(u;a) (right column) for 
q = 1,2,3, where K is the Epanechnikov kernel K(t ) = |(1 — t 2 )I(\t\ < 1). 


on this left boundary kernel is, with probability one, a continuous probability distribution 
function. 

The main purpose of this note is to show that the results presented in Tenreiro (2013) for 


the class of second order boundary kernels are still valid for the enlarged class of boundary 
kernels that satisfy assumption ((01). This objective is achieved in Sections (21 and 0 where 
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we study the boundary and global behaviour of the boundary modified kernel distribution 
function estimator Fnh■ In Section Q] we present exact finite sample comparisons between 
the distribution function kernel estimators based on the left boundary kernels Kff{u',a), 
for q — 1,2, 3, given by ®, © and ©, respectively. We conclude that the boundary 
kernel is especially performing when the classical kernel estimator suffers from severe 
boundary problems. All the proofs can be found in Section El The plots and simulat ions 
in this paper were carried out using the R software i R Development Core Team . 


20111 


2 Boundary behaviour 


In this section we study the boundary behaviour of the kernel distribution function es¬ 
timator F nh {x) by presenting asymptotic expansions for its bias and variance with x in 
the boundary region. We will restrict our attention to the left boundary region }a,a + h[. 
However, similar similar results are valid for the right boundary region ]b — h, b[. 

Theorem 1 . If K L {u\ a) satisfies condition (EP with 


sup |/4 0 ,iJ(aO < oo, 

ae]0,l[ 


and the restriction of F to the interval [a, b\ is twice continuously differentiable, we have: 
a) 

k ‘ 2 =o(h 2 ). 


sup 

xE ]a,a+h[ 


where 


b) 


sup 

ccE ]a,a-\-h[ 


E F nh (x) - F(x) - — F"( x )fj, L ((x - a)/h ) 


ia L (a) = H 2 ,l{oi) ~ a/x ltL (a), a e ]0,1[; 


Var F nh (x) - + - F'(x)v L ({x - a)/h ) 

n n 


= 0(n l h 2 ), 


where 

v L (a) = m ljL (a) + a(l - /r 0 ,L(a) 2 ), a e ]0,1[, 
with = / uB L (u ; a) du, and B L {u ; a) = 2 K L (u\ a)K L (u ; a). 


Remark 1 . The previ ous e xpansions for the bias and variance of F n h(x) extend those 
presented in Tenre irol (120131 . p. 174) for second order boimdary kernels, in which case 
Hl{oI) = /x 2 ,l(«) and u L (a) = for a G ]0,1[. 
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a a 

Figure 2: Functions /i| and —vl for the left boundary kernels Kf, with q = 1,2,3, where 
K is the Epanechnikov kernel. 


Theorem [I] enables us to undertake a first asymptotic comparison between the boundary 
kernels Kf given by fl4]), (© and o, respectively. In Figure [2] we plot the functions p 2 L 
and — vl which respectively correspond to the coefficients of the most significant terms in 
the expansions of the local variance and square bias of estimator F n h(x ) for x in the left 
boundary region. We take for K the Bartlett or Epanechnikov kernel K(t) = |(1— t 2 )I(\t\ < 
1 ), but similar conclusions are valid for other polynomial kernels such as the uniform (in 
this case Kf = Kf), the bi weight or the triweight kernels (for the definition of these kernels 


see 


Wand and Jonesl . 1 9951 . p. 31) 


From the plots we conclude that the boundary kernel Kf has, uniformly over the bound¬ 
ary region, the biggest asymptotic squared bias but also the lowest asymptotic variance 
among the considered boundary kernels. The lowest asymptotic bias is obtained by Kf , 
but this kernel has also the largest asymptotic variance among the considered kernels. We 
postpone to Section [4] the analysis of the combined effect of bias and variance which de¬ 
pends on the underlying distribution F, specially throughout F"(x) 2 and F'{x) that enter 
as coefficients of the terms p 2 L ((x—a)/h ) and —u L ((x—a)/h), respectively, in the asymptotic 
expansions stated in Theorem Q] for the bias and variance of F n h(x). 


3 Global behaviour 

A widely used measure of the quality of the kernel estimator is the mean integrated squared 
error given by 


MISE(F; h) — E { F nh (x) - F(x)} 2 dx 
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= j Var F nh (x)dx + J {EF nh (x) - F(x)} 2 dx 
= : V(F; h) + B(F; h). 


leorems 4.1 and 4.2 of 


Tenreiro (2013) by showing that the MISE 


Jones (1990|) for the classical kernel estimator (JT]) is also valid for the 


Next we extend T 
expansion obtained by 

boundary modified kernel estimator ([2]) when the left boundary kernel satisfies condition 
([6]). As before we assume that the right boundary kernel K R is given by K R (u;a) = 
K L (—u ; a), for u G R and a e]0,1[. 


Theorem 2. If K L {u\a) satisfies condition m with 

\ho, L \(oi) 2 da < oo, 


( 8 ) 


and the restriction of F to the interval [a, b] is twice continuously differentiable, we have: 

\(F;h) — — [ F(x)(l — F(x))dx -/ uB(u)du + O (n~ 1 h 2 ') 

n n I 


and 


B(F; h) = ^ (^j u 2 K(u)dv^j J F'fxfdx + o ( h 4 ) . 


Moreover, if F is not the uniform distribution function on [ a,b\, the asymptotically optimal 
bandwidth, in the sense of minimising the MISE expansion leading terms, is given by 

- 1/3 


ft0 = W (/ W ) _ n-/3, 


where 


S(K) = (^j uB{u)dv^j u 2 K{u)du 


- 2/3 


A classical measure of a distribution function estimator performance is the supremum 
distance between such an estimato r and t he underlying distribution function F. Next 
we extend Theorems 3.1 and 3.2 of Tenreiro n2013n by establishing the almost complete 


uniform convergence and the Chung-Smirnov law of iterated logarithm for kernel estimator 
(|2l) . These p rope rties have been first obtained for estimator ([1]) by iNadaraval (119641) . 

Yamatd ( 19731 1. We denote by || ■ || th e supremum norm. 


ES, 


19791 and 


Winter 


Theorem 3. If K L (u; a) is such that 

sup \Eo,l\(u) < oo, 
ae]0,l[ 



































we have 


|| F n h — F\\ —> 0 almost completely. 

Additionally, if F is Lipschitz on [ a,b] and (n/ log log n) l ^ 2 h —>■ 0, then F nh has the Chung- 
Smirnov property, i.e., 

limsup (2 n/ loglogn) 1 / 2 ||F n / ) , — F| | < 1 almost surely. 

n—>-oo 

The same is true under the less restrictive condition (n/ log log n) l ^ 2 h 2 —>■ 0, whenever K L 
satisfies and F' is Lipschitz on [a, 6 ]. 

Remark 2. The asymptotically optimal bandwidth ho given in Theorem [2] satishes condi¬ 
tion (n/ log log ny/ 2 h 2 —> 0, but not condition (n/ log log n) l ^ 2 h —> 0. 

4 Exact finite sample comparisons 

In this section we compare the boundary performance of the kernel estimator F n h when we 
take for K L one of the left boundary kernels given by fph j) . (1H|) and (17j) , respectively. For that, 
we have used as test distributions some beta mixtures of the form wB( 1, 2) + (l — w)B(2, b ), 
where w G [0,1] and the shape parameter b is such that b > 2. Four values of w = 
0,0.25,0.5,0.75 were considered, which lead to distributions with Ff( 0) = 0,0.5,1,1.5, 
respectively. For each one of the previous weights w, two values for the shape parameter b 
were taken in order to get a second order derivative F"(0) equal to 6 and 30. The considered 
set of test distributions is shown in Figure [3j 

For each one of these test distributions we present in Figure H] the exact mean square 
error of F n h(x ), for x = ah and a G ]0,1[, given by 

MSE(a) = V(a) + B(a) 2 , 


where 


nV(a) := riVaxF n h(a + ah) = / F(a + (a — u)h)B L {u ; a)du — (E F n h(a + ah))' 


and 


B(a) := EF n h(a + ah) — F(a + ah) = / F(a + (a — u)h)K L (u ; a) du — F(a + ah) 


(on these expressions see Section [5] below). The global bandwidth h that determines the 
boundary region was always taken equal to the asymptotically optimal bandwidth h 0 given 
in Theorem [21 and we have considered the sample size n = 50. Similar pictures were 
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F"( 0) = 6 F"(0) = 30 










Figure 3: Beta mixture densities wB(l, 2) + (1 — w)B( 2, b ) with F(_( 0) = 0, 0.5,1,1.5 and 
F"(0) = 6 (left column) and F"(0) = 30 (Vzg7zi column). 
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F"(0) = 6 F"(0) = 30 










Figure 4: MSE(a) for Kff, q — 1,2, 3, with K the Epanechnikov kernel, where F is the beta 
mixture distribution wB( 1,2) + (1 — w)B{2,b) with 0) = 0,0.5,1,1.5, F"( 0) = 6 (left 
column) and F"(0) = 30 (right column). The sample size is n = 50. 
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Figure 5: ISE distributions for the boundary corrected estimators with left boundary kernels 
Kjf, q = 1,2,3, and for the classical estimator with kernel K over the regions [0, h] (left), 
[0,1 — h] (center) and [0,1] (right). F is the beta mixture distribution wB( 1,2) + (1 — 
w)B(2,b) with Ff( 0) = 1.5 and Ff( 0) = 6. The boxplots are based on 500 generated 
samples of size n = 50 and K is the Epanechnikov kernel. 


generated for sample sizes n = 100 and n = 200, but they were not included here to save 
space. As before, we have taken for K the Epanechnikov kernel. 

From the graphics we conclude that the boundary behaviour of the kernel estimator 
based on the boundary kernels Kff, for q = 1, 2, 3, is dominated by the magnitude of the 
underlying density f — F' over the boundary region. For large values of F]_(0) we see 
that the boundary kernel Kf is superior to both K( and Kf, being the advantage over 
the second order boundary kernels bigger for large than for small values of F"(0) 2 . Notice 
that this latter conclusion is in accordance with the asymptotic comparisons presented in 
Section [2j Although less performing than Kf, the kernel K( is, in this case, superior to 
Kf. When the underlying density is such that Ff( 0) = 0, in which case the classical kernel 
estimator does not suffer from boundary problems, we see that the boundary kernels Kf 
and Kf perform similarly being both slightly better than Kf. Finally, for intermediate 
values of F^(0) the three considered left boundary kernels are equally performing. Based 
on this analysis, we conclude that none of the considered boundary kernels is the best over 
the considered set of test distributions. However, the kernel Kf shows to be particularly 
interesting because it is especially performing when the classical boundary kernel estimator 
suffers from severe boundary problems. 

We finish this section with a cautionary note that aims to call the attention of the 
reader to the fact that, due to the continuity of F on M, the boundary effects for kernel 
distribution function estimation may not have the same impact in the global performance 
of the estima tor as in probability density or regression function estimation frameworks (see 


Gasser and Muller, 1979). However, one may have cases where the local behaviour dorni- 


































































12 


nates the global behaviour of the estimator which stresses the relevance in using boundary 
corrections for the classical kernel distribution function estimator. We illustrate this fact 
by taking the above considered beta mixture distribution with F+( 0) = 1.5 and F"( 0) = 6 
(see Figure [3]). In Figure [5] we present the empirical distribution of the integrated square 
error of the classical estimator with kernel K and of the boundary corrected estimators 
with boundary kernels K^, q = 1,2, 3, over the boundary regions [0, h] (left boundary ISE) 
and [1 — h, 1] (right boundary ISE), and over the all interval [0,1] (ISE). The boxplots are 
based on 500 generated samples of size n = 50. We conclude that the local behaviour of 
the estimator over the left boundary region has a clear impact on the global performance 
of the estimator which supports the use of boundary corrections for the classical kernel 
distribution function estimator. 


5 Proofs 


We limit ourselves to present the proof of Theorem [0 The proofs of Th eorems E] and [3] 
follow straightforward from the proofs of the corresponding results given in Tenreiro ( 20131) 
and the asymptotic expansions for bias and variance of F n h(x ) we present below. 

Proof of Theorem [l}a): For x € ]a, a + h[, the expectation of F n h(x ) is given by 


E F nh (x) = / F(x — uh)K L (u ; (x — a)/h) du, 


(see Tenreirol. 12013 . p. 186). By the continuity of the second derivative of F on [a,b] and 
Taylor’s formula, we have 


F(x — uh ) = F(x) — uhF'{x) + u 2 h 2 / (1 — t)F"(x — tuh ) dt, 


(9) 


for — 1 < u < (x — a)/h , from which we deduce that 

h 2 


where 


E F nh (x) - F(x) - —F"{x)h l ({x - a)/h) = A(x, h) + B(x, h ), 


A(x, h ) = F(x)(/i 0iL ((x - a)/h) - l) - hF'(x)n hL ((x - a)/h) 
+ yF"(i)((i - a)/h)n i,l((x - a)/h), 


( 10 ) 


B(x, h) = h 2 (1 — t)(F"(x — tuh ) — F"(x))dtu 2 K L (u ; (x — a)/h ) du, 


and 
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is such that 


h 2 


sup \B(x,h)\<‘— sup |/i 0 ,i|(a) sup \F"(y) - F"(z)\. (11) 

x£]a,a-\-h[ ^ qG]0,1[ y,z€[a,b]: \y—z\<h 

On the other hand, taking into account that F(a ) = 0 and using condition (|6]) and the 
Taylor’s expansions 


F(x) — (x — a)F'(a) + -(x — a) 2 F'\a ) 

+ (x — a) 2 f (1 — t)(F"(a + (x — a)t ) — F"(a))dt 


and 


we get 


sup \A(x, h)\ < h 2 sup |/io,l|(oO sup \F"(y) - F"(z)\. 

x€]a,a-\-h[ aG]0,l[ y,z£[a,b]: \y—z\<h 

Part a) of Theorem |T] follows now from (TTOT) . (ITT]) and (TTTj) . and the fact that 

sup \F"(y) — F"(z)\ — o(l). 

y,zG[a,b]: \y-z\<h 

Proof of Theorem [ljb) : From Part a), the variance of F n h(x ) is given by 
vNaxF nh {x) = j K l (z ; (x — a)/h) 2 hf(x — uh)dz — (E F nh (x)) 2 
= F{x){ 1 - F(x)) + C{x, h ) + 0(h 2 ), 
uniformly in x G ]a, a + h[, where 

C(x,h) — J K l {u ; (x — a)/h) 2 hf(x — uh)du — F{x). 

Moreover, using (J9J) and the fact that 

lim K L (u;a) = 0 and lim R L (u; a) = Ho,l{c(), foraG]0,1[, 

u —>—oo u —>-+oo 

we deduce that 

= F(x)(no, L ((x - a)/h) 2 - l) - hF'(x)mi tL ((x - a)/h) 

+ h 2 j j (1 — t)F"(x — tuh)dtu 2 B l (u\ (x — a)/h)du 
= F(x)(y 0 R(x - a)/h ) 2 - l) - hF'(x)m 1)L ((x - a)/h) + 0(h 2 ), 


( 12 ) 


F\x) = F'(a ) + (x — a)F"(a ) + (x — a) / ( F"(a + (x — a)t) — F"(a))dt, (13) 

Jo 


(14) 


( 15 ) 
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uniformly in x G ]a, a + h[, as sup ae i 01 r J \u 2 B L (u; a)\du < oo. 

Finally, from (TT51) and Taylor’s expansions flXTl) and (HTTP we get 

sup | C(x, h) + hF' {x)vl((x — a)/h) | = 0(h 2 ), 

tcE ]a,a-\-h[ 

which concludes the proof. ■ 
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